Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 72
Filtrar
1.
IEEE Trans Image Process ; 33: 1560-1573, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38358874

RESUMO

In this paper, we focus on the weakly supervised video object detection problem, where each training video is only tagged with object labels, without any bounding box annotations of objects. To effectively train object detectors from such weakly-annotated videos, we propose a Progressive Frame-Proposal Mining (PFPM) framework by exploiting discriminative proposals in a coarse-to-fine manner. First, we design a flexible Multi-Level Selection (MLS) scheme, with explicit guidance of video tags. By selecting object-relevant frames and mining important proposals from these frames, the proposed MLS can effectively reduce frame redundancy as well as improve proposal effectiveness to boost weakly-supervised detectors. Moreover, we develop a novel Holistic-View Refinement (HVR) scheme, which can globally evaluate importance of proposals among frames, and thus correctly refine pseudo ground truth boxes for training video detectors in a self-supervised manner. Finally, we evaluate the proposed PFPM on a large-scale benchmark for video object detection, on ImageNet VID, under the setting of weak annotations. The experimental results demonstrate that our PFPM significantly outperforms the state-of-the-art weakly-supervised detectors.

2.
Nat Genet ; 56(1): 136-142, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-38082204

RESUMO

Most fresh bananas belong to the Cavendish and Gros Michel subgroups. Here, we report chromosome-scale genome assemblies of Cavendish (1.48 Gb) and Gros Michel (1.33 Gb), defining three subgenomes, Ban, Dh and Ze, with Musa acuminata ssp. banksii, malaccensis and zebrina as their major ancestral contributors, respectively. The insertion of repeat sequences in the Fusarium oxysporum f. sp. cubense (Foc) tropical race 4 RGA2 (resistance gene analog 2) promoter was identified in most diploid and triploid bananas. We found that the receptor-like protein (RLP) locus, including Foc race 1-resistant genes, is absent in the Gros Michel Ze subgenome. We identified two NAP (NAC-like, activated by apetala3/pistillata) transcription factor homologs specifically and highly expressed in fruit that directly bind to the promoters of many fruit ripening genes and may be key regulators of fruit ripening. Our genome data should facilitate the breeding and super-domestication of bananas.


Assuntos
Fusarium , Musa , Musa/genética , Fusarium/genética , Triploidia , Melhoramento Vegetal , Fatores de Transcrição/genética , Doenças das Plantas/genética
3.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 2722-2740, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-37988208

RESUMO

Neural Architecture Search (NAS), aiming at automatically designing neural architectures by machines, has been considered a key step toward automatic machine learning. One notable NAS branch is the weight-sharing NAS, which significantly improves search efficiency and allows NAS algorithms to run on ordinary computers. Despite receiving high expectations, this category of methods suffers from low search effectiveness. By employing a generalization boundedness tool, we demonstrate that the devil behind this drawback is the untrustworthy architecture rating with the oversized search space of the possible architectures. Addressing this problem, we modularize a large search space into blocks with small search spaces and develop a family of models with the distilling neural architecture (DNA) techniques. These proposed models, namely a DNA family, are capable of resolving multiple dilemmas of the weight-sharing NAS, such as scalability, efficiency, and multi-modal compatibility. Our proposed DNA models can rate all architecture candidates, as opposed to previous works that can only access a sub- search space using heuristic algorithms. Moreover, under a certain computational complexity constraint, our method can seek architectures with different depths and widths. Extensive experimental evaluations show that our models achieve state-of-the-art top-1 accuracy of 78.9% and 83.6% on ImageNet for a mobile convolutional network and a small vision transformer, respectively. Additionally, we provide in-depth empirical analysis and insights into neural architecture ratings.


Assuntos
Algoritmos , Aprendizado de Máquina , Extratos Vegetais , DNA
4.
IEEE Trans Pattern Anal Mach Intell ; 45(12): 14144-14160, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37669202

RESUMO

Partial person re-identification (ReID) aims to solve the problem of image spatial misalignment due to occlusions or out-of-views. Despite significant progress through the introduction of additional information, such as human pose landmarks, mask maps, and spatial information, partial person ReID remains challenging due to noisy keypoints and impressionable pedestrian representations. To address these issues, we propose a unified attribute-guided collaborative learning scheme for partial person ReID. Specifically, we introduce an adaptive threshold-guided masked graph convolutional network that can dynamically remove untrustworthy edges to suppress the diffusion of noisy keypoints. Furthermore, we incorporate human attributes and devise a cyclic heterogeneous graph convolutional network to effectively fuse cross-modal pedestrian information through intra- and inter-graph interaction, resulting in robust pedestrian representations. Finally, to enhance keypoint representation learning, we design a novel part-based similarity constraint based on the axisymmetric characteristic of the human body. Extensive experiments on multiple public datasets have shown that our model achieves superior performance compared to other state-of-the-art baselines.

5.
IEEE Trans Image Process ; 32: 4951-4963, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37643102

RESUMO

Weakly supervised person search involves training a model with only bounding box annotations, without human-annotated identities. Clustering algorithms are commonly used to assign pseudo-labels to facilitate this task. However, inaccurate pseudo-labels and imbalanced identity distributions can result in severe label and sample noise. In this work, we propose a novel Collaborative Contrastive Refining (CCR) weakly-supervised framework for person search that jointly refines pseudo-labels and the sample-learning process with different contrastive strategies. Specifically, we adopt a hybrid contrastive strategy that leverages both visual and context clues to refine pseudo-labels, and leverage the sample-mining and noise-contrastive strategy to reduce the negative impact of imbalanced distributions by distinguishing positive samples and noise samples. Our method brings two main advantages: 1) it facilitates better clustering results for refining pseudo-labels by exploring the hybrid similarity; 2) it is better at distinguishing query samples and noise samples for refining the sample-learning process. Extensive experiments demonstrate the superiority of our approach over the state-of-the-art weakly supervised methods by a large margin (more than 3% mAP on CUHK-SYSU). Moreover, by leveraging more diverse unlabeled data, our method achieves comparable or even better performance than the state-of-the-art supervised methods.

6.
IEEE Trans Image Process ; 32: 5126-5137, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37643103

RESUMO

The goal of Camouflaged object detection (COD) is to detect objects that are visually embedded in their surroundings. Existing COD methods only focus on detecting camouflaged objects from seen classes, while they suffer from performance degradation to detect unseen classes. However, in a real-world scenario, collecting sufficient data for seen classes is extremely difficult and labeling them requires high professional skills, thereby making these COD methods not applicable. In this paper, we propose a new zero-shot COD framework (termed as ZSCOD), which can effectively detect the never unseen classes. Specifically, our framework includes a Dynamic Graph Searching Network (DGSNet) and a Camouflaged Visual Reasoning Generator (CVRG). In details, DGSNet is proposed to adaptively capture more edge details for boosting the COD performance. CVRG is utilized to produce pseudo-features that are closer to the real features of the seen camouflaged objects, which can transfer knowledge from seen classes to unseen classes to help detect unseen objects. Besides, our graph reasoning is built on a dynamic searching strategy, which can pay more attention to the boundaries of objects for reducing the influences of background. More importantly, we construct the first zero-shot COD benchmark based on the COD10K dataset. Experimental results on public datasets show that our ZSCOD not only detects the camouflaged object of unseen classes but also achieves state-of-the-art performance in detecting seen classes.

7.
Artigo em Inglês | MEDLINE | ID: mdl-37163401

RESUMO

Convolutional neural networks (CNNs) have been successfully applied to the single target tracking task in recent years. Generally, training a deep CNN model requires numerous labeled training samples, and the number and quality of these samples directly affect the representational capability of the trained model. However, this approach is restrictive in practice, because manually labeling such a large number of training samples is time-consuming and prohibitively expensive. In this article, we propose an active learning method for deep visual tracking, which selects and annotates the unlabeled samples to train the deep CNN model. Under the guidance of active learning, the tracker based on the trained deep CNN model can achieve competitive tracking performance while reducing the labeling cost. More specifically, to ensure the diversity of selected samples, we propose an active learning method based on multiframe collaboration to select those training samples that should be and need to be annotated. Meanwhile, considering the representativeness of these selected samples, we adopt a nearest-neighbor discrimination method based on the average nearest-neighbor distance to screen isolated samples and low-quality samples. Therefore, the training samples' subset selected based on our method requires only a given budget to maintain the diversity and representativeness of the entire sample set. Furthermore, we adopt a Tversky loss to improve the bounding box estimation of our tracker, which can ensure that the tracker achieves more accurate target states. Extensive experimental results confirm that our active-learning-based tracker (ALT) achieves competitive tracking accuracy and speed compared with state-of-the-art trackers on the seven most challenging evaluation benchmarks. Project website: https://sites.google.com/view/altrack/.

8.
Artigo em Inglês | MEDLINE | ID: mdl-37021992

RESUMO

Unlike visual object tracking, thermal infrared (TIR) object tracking methods can track the target of interest in poor visibility such as rain, snow, and fog, or even in total darkness. This feature brings a wide range of application prospects for TIR object-tracking methods. However, this field lacks a unified and large-scale training and evaluation benchmark, which has severely hindered its development. To this end, we present a large-scale and high-diversity unified TIR single object tracking benchmark, called LSOTB-TIR, which consists of a tracking evaluation dataset and a general training dataset with a total of 1416 TIR sequences and more than 643 K frames. We annotate the bounding box of objects in every frame of all sequences and generate over 770 K bounding boxes in total. To the best of our knowledge, LSOTB-TIR is the largest and most diverse TIR object tracking benchmark to date. We spilt the evaluation dataset into a short-term tracking subset and a long-term tracking subset to evaluate trackers using different paradigms. What's more, to evaluate a tracker on different attributes, we also define four scenario attributes and 12 challenge attributes in the short-term tracking evaluation subset. By releasing LSOTB-TIR, we encourage the community to develop deep learning-based TIR trackers and evaluate them fairly and comprehensively. We evaluate and analyze 40 trackers on LSOTB-TIR to provide a series of baselines and give some insights and future research directions in TIR object tracking. Furthermore, we retrain several representative deep trackers on LSOTB-TIR, and their results demonstrate that the proposed training dataset significantly improves the performance of deep TIR trackers. Codes and dataset are available at https://github.com/QiaoLiuHit/LSOTB-TIR.

9.
Artigo em Inglês | MEDLINE | ID: mdl-37027761

RESUMO

Recently, tremendous human-designed and automatically searched neural networks have been applied to image denoising. However, previous works intend to handle all noisy images in a pre-defined static network architecture, which inevitably leads to high computational complexity for good denoising quality. Here, we present a dynamic slimmable denoising network (DDS-Net), a general method to achieve good denoising quality with less computational complexity, via dynamically adjusting the channel configurations of networks at test time with respect to different noisy images. Our DDS-Net is empowered with the ability of dynamic inference by a dynamic gate, which can predictively adjust the channel configuration of networks with negligible extra computation cost. To ensure the performance of each candidate sub-network and the fairness of the dynamic gate, we propose a three-stage optimization scheme. In the first stage, we train a weight-shared slimmable super network. In the second stage, we evaluate the trained slimmable super network in an iterative way and progressively tailor the channel numbers of each layer with minimal denoising quality drop. By a single pass, we can obtain several sub-networks with good performance under different channel configurations. In the last stage, we identify easy and hard samples in an online way and train a dynamic gate to predictively select the corresponding sub-network with respect to different noisy images. Extensive experiments demonstrate our DDS-Net consistently outperforms the state-of-the-art individually trained static denoising networks.

10.
IEEE Trans Pattern Anal Mach Intell ; 45(8): 10555-10579, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-37028387

RESUMO

Object detection (OD) is a crucial computer vision task that has seen the development of many algorithms and models over the years. While the performance of current OD models has improved, they have also become more complex, making them impractical for industry applications due to their large parameter size. To tackle this problem, knowledge distillation (KD) technology was proposed in 2015 for image classification and subsequently extended to other visual tasks due to its ability to transfer knowledge learned by complex teacher models to lightweight student models. This paper presents a comprehensive survey of KD-based OD models developed in recent years, with the aim of providing researchers with an overview of recent progress in the field. We conduct an in-depth analysis of existing works, highlighting their advantages and limitations, and explore future research directions to inspire the design of models for related tasks. We summarize the basic principles of designing KD-based OD models, describe related KD-based OD tasks, including performance improvements for lightweight models, catastrophic forgetting in incremental OD, small object detection, and weakly/semi-supervised OD. We also analyze novel distillation techniques, i.e. different types of distillation loss, feature interaction between teacher and student models, etc. Additionally, we provide an overview of the extended applications of KD-based OD models on specific datasets, such as remote sensing images and 3D point cloud datasets. We compare and analyze the performance of different models on several common datasets and discuss promising directions for solving specific OD problems.


Assuntos
Algoritmos , Aprendizagem , Humanos
11.
Artigo em Inglês | MEDLINE | ID: mdl-37018641

RESUMO

Unsupervised feature selection chooses a subset of discriminative features to reduce feature dimension under the unsupervised learning paradigm. Although lots of efforts have been made so far, existing solutions perform feature selection either without any label guidance or with only single pseudo label guidance. They may cause significant information loss and lead to semantic shortage of the selected features as many real-world data, such as images and videos are generally annotated with multiple labels. In this paper, we propose a new Unsupervised Adaptive Feature Selection with Binary Hashing (UAFS-BH) model, which learns binary hash codes as weakly-supervised multi-labels and simultaneously exploits the learned labels to guide feature selection. Specifically, in order to exploit the discriminative information under the unsupervised scenarios, the weakly-supervised multi-labels are learned automatically by specially imposing binary hash constraints on the spectral embedding process to guide the ultimate feature selection. The number of weakly-supervised multi-labels (the number of "1" in binary hash codes) is adaptively determined according to the specific data content. Further, to enhance the discriminative capability of binary labels, we model the intrinsic data structure by adaptively constructing the dynamic similarity graph. Finally, we extend UAFS-BH to multi-view setting as Multi-view Feature Selection with Binary Hashing (MVFS-BH) to handle the multi-view feature selection problem. An effective binary optimization method based on the Augmented Lagrangian Multiple (ALM) is derived to iteratively solve the formulated problem. Extensive experiments on widely tested benchmarks demonstrate the state-of-the-art performance of the proposed method on both single-view and multi-view feature selection tasks. For the purpose of reproducibility, we provide the source codes and testing datasets at https://github.com/shidan0122/UMFS.git..

12.
Artigo em Inglês | MEDLINE | ID: mdl-37030854

RESUMO

With the rapid progress of deep neural network (DNN) applications on memristive platforms, there has been a growing interest in the acceleration and compression of memristive networks. As an emerging model optimization technique for memristive platforms, bit-level sparsity training (with the fixed-point quantization) can significantly reduce the demand for analog-to-digital converters (ADCs) resolution, which is critical for energy and area consumption. However, the bit sparsity and the fixed-point quantization will inevitably lead to a large performance loss. Different from the existing training and optimization techniques, this work attempts to explore more sparsity-tolerant architectures to compensate for performance degradation. We first empirically demonstrate that in a certain search space (e.g., 4-bit quantized DARTS space), network architectures differ in bit-level sparsity tolerance. It is reasonable and necessary to search the architectures for efficient deployment on memristive platforms by the neural architecture search (NAS) technology. We further introduce bit-level sparsity-tolerant NAS (BST-NAS), which encapsulates low-precision quantization and bit-level sparsity training into the differentiable NAS, to explore the optimal bit-level sparsity-tolerant architectures. Experimentally, with the same degree of sparsity and experiment settings, our searched architectures obtain a promising performance, which outperform the normal NAS-based DARTS-series architectures (about 5.8% higher than that of DARTS-V2 and 2.7% higher than that of PC-DARTS) on CIFAR10.

13.
J Exp Bot ; 74(4): 1275-1290, 2023 02 13.
Artigo em Inglês | MEDLINE | ID: mdl-36433929

RESUMO

Jasminum sambac is a well-known plant for its attractive and exceptional fragrance, the flowers of which are used to produce scented tea. Jasmonate (JA), an important plant hormone was first identified in Jasminum species. Jasmine plants contain abundant JA naturally, of which the molecular mechanisms of synthesis and accumulation are not clearly understood. Here, we report a telomere-to-telomere consensus assembly of a double-petal J. sambac genome along with two haplotype-resolved genomes. We found that gain-and-loss, positive selection, and allelic specific expression of aromatic volatile-related genes contributed to the stronger flower fragrance in double-petal J. sambac compared with single- and multi-petal jasmines. Through comprehensive comparative genomic, transcriptomic, and metabolomic analyses of double-petal J. sambac, we revealed the genetic basis of the production of aromatic volatiles and salicylic acid (SA), and the accumulation of JA under non-stress conditions. We identified several key genes associated with JA biosynthesis, and their non-stress related activities lead to extraordinarily high concentrations of JA in tissues. High JA synthesis coupled with low degradation in J. sambac results in accumulation of high JA under typical environmental conditions, similar to the accumulation mechanism of SA. This study offers important insights into the biology of J. sambac, and provides valuable genomic resources for further utilization of natural products.


Assuntos
Jasminum , Jasminum/genética , Perfilação da Expressão Gênica , Transcriptoma , Odorantes
14.
IEEE Trans Pattern Anal Mach Intell ; 45(4): 4430-4446, 2023 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-35895643

RESUMO

Dynamic networks have shown their promising capability in reducing theoretical computation complexity by adapting their architectures to the input during inference. However, their practical runtime usually lags behind the theoretical acceleration due to inefficient sparsity. In this paper, we explore a hardware-efficient dynamic inference regime, named dynamic weight slicing, that can generalized well on multiple dimensions in both CNNs and transformers (e.g. kernel size, embedding dimension, number of heads, etc.). Instead of adaptively selecting important weight elements in a sparse way, we pre-define dense weight slices with different importance level by nested residual learning. During inference, weights are progressively sliced beginning with the most important elements to less important ones to achieve different model capacity for inputs with diverse difficulty levels. Based on this conception, we present DS-CNN++ and DS-ViT++, by carefully designing the double headed dynamic gate and the overall network architecture. We further propose dynamic idle slicing to address the drastic reduction of embedding dimension in DS-ViT++. To ensure sub-network generality and routing fairness, we propose a disentangled two-stage optimization scheme. In Stage I, in-place bootstrapping (IB) and multi-view consistency (MvCo) are proposed to stablize and improve the training of DS-CNN++ and DS-ViT++ supernet, respectively. In Stage II, sandwich gate sparsification (SGS) is proposed to assist the gate training. Extensive experiments on 4 datasets and 3 different network architectures demonstrate our methods consistently outperform the state-of-the-art static and dynamic model compression methods by a large margin (up to 6.6%). Typically, we achieves 2-4× computation reduction and up to 61.5% real-world acceleration on MobileNet, ResNet-50 and Vision Transformer, with minimal accuracy drops on ImageNet. Code release: https://github.com/changlin31/DS-Net.

15.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 3918-3932, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-35679386

RESUMO

The main challenge in the field of unsupervised machine translation (UMT) is to associate source-target sentences in the latent space. As people who speak different languages share biologically similar visual systems, various unsupervised multi-modal machine translation (UMMT) models have been proposed to improve the performances of UMT by employing visual contents in natural images to facilitate alignment. Commonly, relation information is the important semantic in a sentence. Compared with images, videos can better present the interactions between objects and the ways in which an object transforms over time. However, current state-of-the-art methods only explore scene-level or object-level information from images without explicitly modeling objects relation; thus, they are sensitive to spurious correlations, which poses a new challenge for UMMT models. In this paper, we employ a spatial-temporal graph obtained from videos to exploit object interactions in space and time for disambiguation purposes and to promote latent space alignment in UMMT. Our model employs multi-modal back-translation and features pseudo-visual pivoting, in which we learn a shared multilingual visual-semantic embedding space and incorporate visually pivoted captioning as additional weak supervision. Experimental results on the VATEX Translation 2020 and HowToWorld datasets validate the translation capabilities of our model on both sentence-level and word-level and generalizes well when videos are not available during the testing phase.

16.
IEEE Trans Pattern Anal Mach Intell ; 45(1): 1-26, 2023 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-34941499

RESUMO

Scene graph is a structured representation of a scene that can clearly express the objects, attributes, and relationships between objects in the scene. As computer vision technology continues to develop, people are no longer satisfied with simply detecting and recognizing objects in images; instead, people look forward to a higher level of understanding and reasoning about visual scenes. For example, given an image, we want to not only detect and recognize objects in the image, but also understand the relationship between objects (visual relationship detection), and generate a text description (image captioning) based on the image content. Alternatively, we might want the machine to tell us what the little girl in the image is doing (Visual Question Answering (VQA)), or even remove the dog from the image and find similar images (image editing and retrieval), etc. These tasks require a higher level of understanding and reasoning for image vision tasks. The scene graph is just such a powerful tool for scene understanding. Therefore, scene graphs have attracted the attention of a large number of researchers, and related research is often cross-modal, complex, and rapidly developing. However, no relatively systematic survey of scene graphs exists at present. To this end, this survey conducts a comprehensive investigation of the current scene graph research. More specifically, we first summarize the general definition of the scene graph, then conducte a comprehensive and systematic discussion on the generation method of the scene graph (SGG) and the SGG with the aid of prior knowledge. We then investigate the main applications of scene graphs and summarize the most commonly used datasets. Finally, we provide some insights into the future development of scene graphs.

17.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 3848-3861, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-35709117

RESUMO

An integral part of video analysis and surveillance is temporal activity detection, which means to simultaneously recognize and localize activities in long untrimmed videos. Currently, the most effective methods of temporal activity detection are based on deep learning, and they typically perform very well with large scale annotated videos for training. However, these methods are limited in real applications due to the unavailable videos about certain activity classes and the time-consuming data annotation. To solve this challenging problem, we propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training still need to be detected. We design an end-to-end deep transferable network TN-ZSTAD as the architecture for this solution. On the one hand, this network utilizes an activity graph transformer to predict a set of activity instances that appear in the video, rather than produces many activity proposals in advance. On the other hand, this network captures the common semantics of seen and unseen activities from their corresponding label embeddings, and it is optimized with an innovative loss function that considers the classification property on seen activities and the transfer property on unseen activities together. Experiments on the THUMOS'14, Charades, and ActivityNet datasets show promising performance in terms of detecting unseen activities.

18.
World Wide Web ; 26(1): 253-270, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36060430

RESUMO

Medical reports have significant clinical value to radiologists and specialists, especially during a pandemic like COVID. However, beyond the common difficulties faced in the natural image captioning, medical report generation specifically requires the model to describe a medical image with a fine-grained and semantic-coherence paragraph that should satisfy both medical commonsense and logic. Previous works generally extract the global image features and attempt to generate a paragraph that is similar to referenced reports; however, this approach has two limitations. Firstly, the regions of primary interest to radiologists are usually located in a small area of the global image, meaning that the remainder parts of the image could be considered as irrelevant noise in the training procedure. Secondly, there are many similar sentences used in each medical report to describe the normal regions of the image, which causes serious data bias. This deviation is likely to teach models to generate these inessential sentences on a regular basis. To address these problems, we propose an Auxiliary Signal-Guided Knowledge Encoder-Decoder (ASGK) to mimic radiologists' working patterns. Specifically, the auxiliary patches are explored to expand the widely used visual patch features before fed to the Transformer encoder, while the external linguistic signals help the decoder better master prior knowledge during the pre-training process. Our approach performs well on common benchmarks, including CX-CHR, IU X-Ray, and COVID-19 CT Report dataset (COV-CTR), demonstrating combining auxiliary signals with transformer architecture can bring a significant improvement in terms of medical report generation. The experimental results confirm that auxiliary signals driven Transformer-based models are with solid capabilities to outperform previous approaches on both medical terminology classification and paragraph generation metrics.

19.
IEEE Trans Image Process ; 31: 4733-4745, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35793293

RESUMO

Fashion Compatibility Modeling (FCM), which aims to automatically evaluate whether a given set of fashion items makes a compatible outfit, has attracted increasing research attention. Recent studies have demonstrated the benefits of conducting the item representation disentanglement towards FCM. Although these efforts have achieved prominent progress, they still perform unsatisfactorily, as they mainly investigate the visual content of fashion items, while overlooking the semantic attributes of items (e.g., color and pattern), which could largely boost the model performance and interpretability. To address this issue, we propose to comprehensively explore the visual content and attributes of fashion items towards FCM. This problem is non-trivial considering the following challenges: a) how to utilize the irregular attribute labels of items to partially supervise the attribute-level representation learning of fashion items; b) how to ensure the intact disentanglement of attribute-level representations; and c) how to effectively sew the multiple granulairites (i.e, coarse-grained item-level and fine-grained attribute-level) information to enable performance improvement and interpretability. To address these challenges, in this work, we present a partially supervised outfit compatibility modeling scheme (PS-OCM). In particular, we first devise a partially supervised attribute-level embedding learning component to disentangle the fine-grained attribute embeddings from the entire visual feature of each item. We then introduce a disentangled completeness regularizer to prevent the information loss during disentanglement. Thereafter, we design a hierarchical graph convolutional network, which seamlessly integrates the attribute- and item-level compatibility modeling, and enables the explainable compatibility reasoning. Extensive experiments on the real-world dataset demonstrate that our PS-OCM significantly outperforms the state-of-the-art baselines. We have released our source codes and well-trained models to benefit other researchers (https://site2750.wixsite.com/ps-ocm).

20.
Artigo em Inglês | MEDLINE | ID: mdl-34982675

RESUMO

Zero-shot object detection (ZSD), the task that extends conventional detection models to detecting objects from unseen categories, has emerged as a new challenge in computer vision. Most existing approaches on ZSD are based on a strict mapping-transfer strategy that learns a mapping function from visual to semantic space over seen categories, then directly generalizes the learned mapping function to unseen object detection. However, the ZSD task still remains challenging, since those works fail to consider the two key factors that hamper the ZSD performance: (a) the domain shift problem between seen and unseen classes leads to poor transferable ability of the model; (b) the original visual feature space is suboptimal for ZSD since it lacks discriminative information.To alleviate these issues, we develop a novel Semantics-Guided Contrastive Network for ZSD (ContrastZSD), a detection framework that first brings the contrastive learning paradigm into the realm of ZSD. The pairwise contrastive tasks take advantage of class label and semantic relation as additional supervision signals. Under the guidance of those explicit semantic supervision, the model can learn more knowledge about unseen categories to avoid over-fitting to the seen concepts.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA